Goto

Collaborating Authors

 splice site


DeepVRegulome: DNABERT-based deep-learning framework for predicting the functional impact of short genomic variants on the human regulome

Dutta, Pratik, Obusan, Matthew, Sathian, Rekha, Chao, Max, Surana, Pallavi, Papineni, Nimisha, Ji, Yanrong, Zhou, Zhihan, Liu, Han, Yurovsky, Alisa, Davuluri, Ramana V

arXiv.org Artificial Intelligence

Whole-genome sequencing (WGS) has revealed numerous non-coding short variants whose functional impacts remain poorly understood. Despite recent advances in deep-learning genomic approaches, accurately predicting and prioritizing clinically relevant mutations in gene regulatory regions remains a major challenge. Here we introduce Deep VRegulome, a deep-learning method for prediction and interpretation of functionally disruptive variants in the human regulome, which combines 700 DNABERT fine-tuned models, trained on vast amounts of ENCODE gene regulatory regions, with variant scoring, motif analysis, attention-based visualization, and survival analysis. We showcase its application on TCGA glioblastoma WGS dataset in prioritizing survival-associated mutations and regulatory regions. The analysis identified 572 splice-disrupting and 9,837 transcription-factor binding site altering mutations occurring in greater than 10% of glioblastoma samples. Survival analysis linked 1352 mutations and 563 disrupted regulatory regions to patient outcomes, enabling stratification via non-coding mutation signatures. All the code, fine-tuned models, and an interactive data portal are publicly available.


CircFormerMoE: An End-to-End Deep Learning Framework for Circular RNA Splice Site Detection and Pairing in Plant Genomes

Jiang, Tianyou

arXiv.org Artificial Intelligence

Circular RNAs (circRNAs) are important components of the non-coding RNA regulatory network. Previous circRNA identification primarily relies on high-throughput RNA sequencing (RNA-seq) data combined with alignment-based algorithms that detect back-splicing signals. However, these methods face several limitations: they can't predict circRNAs directly from genomic DNA sequences and relies heavily on RNA experimental data; they involve high computational costs due to complex alignment and filtering steps; and they are inefficient for large-scale or genome-wide circRNA prediction. The challenge is even greater in plants, where plant circRNA splice sites often lack the canonical GT-AG motif seen in human mRNA splicing, and no efficient deep learning model with strong generalization capability currently exists. Furthermore, the number of currently identified plant circRNAs is likely far lower than their true abundance. In this paper, we propose a deep learning framework named CircFormerMoE based on transformers and mixture-of experts for predicting circRNAs directly from plant genomic DNA. Our framework consists of two subtasks known as splicing site detection (SSD) and splicing site pairing (SSP). The model's effectiveness has been validated on gene data of 10 plant species. Trained on known circRNA instances, it is also capable of discovering previously unannotated circRNAs. In addition, we performed interpretability analyses on the trained model to investigate the sequence patterns contributing to its predictions. Our framework provides a fast and accurate computational method and tool for large-scale circRNA discovery in plants, laying a foundation for future research in plant functional genomics and non-coding RNA annotation.


Horizon-wise Learning Paradigm Promotes Gene Splicing Identification

Li, Qi-Jie, Sun, Qian, Zhang, Shao-Qun

arXiv.org Artificial Intelligence

Identifying gene splicing is a core and significant task confronted in modern collaboration between artificial intelligence and bioinformatics. Past decades have witnessed great efforts on this concern, such as the bio-plausible splicing pattern AT-CG and the famous SpliceAI. In this paper, we propose a novel framework for the task of gene splicing identification, named Horizon-wise Gene Splicing Identification (H-GSI). The proposed H-GSI follows the horizon-wise identification paradigm and comprises four components: the pre-processing procedure transforming string data into tensors, the sliding window technique handling long sequences, the SeqLab model, and the predictor. In contrast to existing studies that process gene information with a truncated fixed-length sequence, H-GSI employs a horizon-wise identification paradigm in which all positions in a sequence are predicted with only one forward computation, improving accuracy and efficiency. The experiments conducted on the real-world Human dataset show that our proposed H-GSI outperforms SpliceAI and achieves the best accuracy of 97.20\%. The source code is available from this link.


Identifying DNA Sequence Motifs Using Deep Learning

Poddar, Asmita, Uzun, Vladimir, Tunbridge, Elizabeth, Haerty, Wilfried, Nevado-Holgado, Alejo

arXiv.org Artificial Intelligence

Advancements in genomic technologies [22] have enabled the generation of vast amounts of DNA sequence data, enabling the structural and functional study of the human genome [3]. This has been valuable in providing insights into the genetic basis of diseases. Since splice sites play a crucial role in this, accurately predicting these sites in DNA sequences is essential for diagnosing diseases. The internal structure of DNA sequences consists of alternating protein-coding regions that contain information for the production of proteins called exons, and non-coding regions that interrupt the coding sequence called introns, as shown in Figure 1. Mutations in the intronic region have been associated with developmental disorders such as isolated Pierre Robin sequence [28] and several types of cancer [31]. Hence, the accurate identification of boundaries between the exons and introns, known as splice sites, has special biological significance with healthcare implications, such as to study the sources of unresolved genetic disease. Current annotations of the human genome to identify where exons/introns are located in the sequence [10] are far from complete and much remains unknown about the DNA sequence motifs that indicate the presence of a splice site. We aim to address this challenge computationally by predicting the presence of splice sites -- both the exon start (intron end) sites, known as acceptor sites, and the exon end (intron start) sites, known as donor sites -- within the DNA sequence.


Transcriptomic signatures across human tissues identify functional rare genetic variation

Science

Every human genome contains tens of thousands of rare genetic variants—which include single nucleotide changes, insertions or deletions, and larger structural variants—and some may have a functional effect. Ferraro et al. examined data from individuals in the Genotype-Tissue Expression (GTEx) project for outliers across tissues caused by gene expression, splicing, and allele-specific expression. Single rare variants were observed that affected the expression and allele-specific expression of multiple genes and, in the case of a gene fusion event, splicing. Experimental and computational validation suggest that many individuals carry more than 50 rare variants that affect transcription in some way. Although most variants were predicted to not affect an individual's phenotype, a small percentage showed likely disease-related associations, emphasizing the importance of studying the impact of rare genetic variation on the transcriptome. Science , this issue p. [eaaz5900][1] ### INTRODUCTION The human genome contains tens of thousands of rare (minor allele frequency <1%) variants, some of which contribute to disease risk. Using 838 samples with whole-genome and multitissue transcriptome sequencing data in the Genotype-Tissue Expression (GTEx) project version 8, we assessed how rare genetic variants contribute to extreme patterns in gene expression (eOutliers), allelic expression (aseOutliers), and alternative splicing (sOutliers). We integrated these three signals across 49 tissues with genomic annotations to prioritize high-impact rare variants (RVs) that associate with human traits. ### RATIONALE Outlier gene expression aids in identifying functional RVs. Transcriptome sequencing provides diverse measurements beyond gene expression, including allele-specific expression and alternative splicing, which can provide additional insight into RV functional effects. ### RESULTS After identifying multitissue eOutliers, aseOutliers, and sOutliers, we found that outlier individuals of each type were significantly more likely to carry an RV near the corresponding gene. Among eOutliers, we observed strong enrichment of rare structural variants. sOutliers were particularly enriched for RVs that disrupted or created a splicing consensus sequence. aseOutliers provided the strongest enrichment signal when evaluated from just a single tissue. We developed Watershed, a probabilistic model for personal genome interpretation that improves over standard genomic annotation–based methods for scoring RVs by integrating these three transcriptomic signals from the same individual and replicates in an independent cohort. To assess whether outlier RVs identified in GTEx associate with traits, we evaluated these variants for association with diverse traits in the UK Biobank, the Million Veterans Program, and the Jackson Heart Study. We found that transcriptome-assisted prioritization identified RVs with larger trait effect sizes and were better predictors of effect size than genomic annotation alone. ### CONCLUSION With >800 genomes matched with transcriptomes across 49 tissues, we were able to study RVs that underlie extreme changes in the transcriptome. To capture the diversity of these extreme changes, we developed and integrated approaches to identify expression, allele-specific expression, and alternative splicing outliers, and characterized the RV landscape underlying each outlier signal. We demonstrate that personal genome interpretation and RV discovery is enhanced by using these signals. This approach provides a new means to integrate a richer set of functional RVs into models of genetic burden, improve disease gene identification, and enable the delivery of precision genomics. ![Figure][2] Transcriptomic signatures identify functional rare genetic variation. We identified genes in individuals that show outlier expression, allele-specific expression, or alternative splicing and assessed enrichment of nearby rare variation. We integrated these three outlier signals with genomic annotation data to prioritize functional RVs and to intersect those variants with disease loci to identify potential RV trait associations. Rare genetic variants are abundant across the human genome, and identifying their function and phenotypic impact is a major challenge. Measuring aberrant gene expression has aided in identifying functional, large-effect rare variants (RVs). Here, we expanded detection of genetically driven transcriptome abnormalities by analyzing gene expression, allele-specific expression, and alternative splicing from multitissue RNA-sequencing data, and demonstrate that each signal informs unique classes of RVs. We developed Watershed, a probabilistic model that integrates multiple genomic and transcriptomic signals to predict variant function, validated these predictions in additional cohorts and through experimental assays, and used them to assess RVs in the UK Biobank, the Million Veterans Program, and the Jackson Heart Study. Our results link thousands of RVs to diverse molecular effects and provide evidence to associate RVs affecting the transcriptome with human traits. [1]: /lookup/doi/10.1126/science.aaz5900 [2]: pending:yes


Developing parsimonious ensembles using ensemble diversity within a reinforcement learning framework

Stanescu, Ana, Pandey, Gaurav

arXiv.org Machine Learning

Heterogeneous ensembles built from the predictions of a wide variety and large number of diverse base predictors represent a potent approach to building predictive models for problems where the ideal base/individual predictor may not be obvious. Ensemble selection is an especially promising approach here, not only for improving prediction performance, but also because of its ability to select a collectively predictive subset, often a relatively small one, of the base predictors. In this paper, we present a set of algorithms that explicitly incorporate ensemble diversity, a known factor influencing predictive performance of ensembles, into a reinforcement learning framework for ensemble selection. We rigorously tested these approaches on several challenging problems and associated data sets, yielding that several of them produced more accurate ensembles than those that don't explicitly consider diversity. More importantly, these diversity-incorporating ensembles were much smaller in size, i.e., more parsimonious, than the latter types of ensembles. This can eventually aid the interpretation or reverse engineering of predictive models assimilated into the resultant ensemble(s).


mTim: Rapid and accurate transcript reconstruction from RNA-Seq data

Zeller, Georg, Goernitz, Nico, Kahles, Andre, Behr, Jonas, Mudrakarta, Pramod, Sonnenburg, Soeren, Raetsch, Gunnar

arXiv.org Machine Learning

High-throughput sequencing technology applied to cellular mRNA (RNA-Seq) has revolutionized transcriptome studies [19, 17, 35, among many others]. In contrast to microarray platforms, which it has replaced in many applications, RNA-Seq can not only be used to accurately quantify known transcripts, but also to reveal the precise structure of transcripts at single-nucleotide resolution. RNA-Seq based transcript reconstruction has therefore become a valuable tool for the completion of genome annotations [22, 11, for instance] and further enabled subsequent analyses of differentially expressed genes [2], transcript isoforms [6, 4] and exons [3], all of which generally rely on correctly inferred transcript inventories. De novo transcript reconstruction is thus a pivotal step in the analysis of RNA-Seq data. There are two conceptually different strategies to approach this problem: one can either assemble transcripts directly from RNA-Seq reads using methodology that originated from genome assembly approaches [13, 23, 25].


An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

Schweikert, Gabriele, Rätsch, Gunnar, Widmer, Christian, Schölkopf, Bernhard

Neural Information Processing Systems

We study the problem of domain transfer for a supervised classification task in mRNA splicing. We consider a number of recent domain transfer methods from machine learning, including some that are novel, and evaluate them on genomic sequence data from model organisms of varying evolutionary distance. We find that in cases where the organisms are not closely related, the use of domain adaptation methods can help improve classification performance.